Discovering comorbid diseases using an inter-disease interactivity network based on biobank-scale PheWAS data

Abstract Motivation Understanding comorbidity is essential for disease prevention, treatment and prognosis. In particular, insight into which pairs of diseases are likely or unlikely to co-occur may help elucidate the potential relationships between complex diseases. Here, we introduce the use of an inter-disease interactivity network to discover/prioritize comorbidities. Specifically, we determine disease associations by accounting for the direction of effects of genetic components shared between diseases, and categorize those associations as synergistic or antagonistic. We further develop a comorbidity scoring algorithm to predict whether diseases are more or less likely to co-occur in the presence of a given index disease. This algorithm can handle networks that incorporate relationships with opposite signs. Results We finally investigate inter-disease associations among 427 phenotypes in UK Biobank PheWAS data and predict the priority of comorbid diseases. The predicted comorbidities were verified using the UK Biobank inpatient electronic health records. Our findings demonstrate that considering the interaction of phenotype associations might be helpful in better predicting comorbidity. Availability and implementation The source code and data of this study are available at https://github.com/dokyoonkimlab/DiseaseInteractiveNetwork. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Comorbidity describes when a given patient has one or more additional medical conditions co-occurring alongside a primary disease (Lee et al., 2008;Valderas et al., 2009). Predicting comorbidity is important for planning the clinical care of individual patients and for investigating clinical epidemiology; in particular, mortality risk is often increased when patients with underlying diseases are also diagnosed with adverse medical outcomes or complications (Cho et al., 2021;Jørgensen et al., 2012;Nashiry et al., 2021).
Comorbidity studies involving multiple phenotypes at the population level can be divided into two categories according to the approach employed: (i) statistical analysis based on disease prevalence, or (ii) network analysis based on sharing of components among diseases. Notable studies using the statistical approach include Buddeke et al. (2019), a cohort study based on clinical records that presented complications with cardiovascular diseases (Buddeke et al., 2019;Klimek et al. (2015), which used nationwide claim data to identify diabetes-related comorbidities (Klimek et al., 2015); and Richardson et al. (2020), which revealed conditions comorbid with SARS-CoV-2 among 5700 hospitalized patients (Richardson et al., 2020). Similarly, studies using network-based approaches and disease-disease associations include: Hidalgo et al. (2009), which built comorbidity maps for 657 diseases from the clinical records of over 30 million patients, with connectivity defined by prevalence-based comorbidity measures (Hidalgo et al., 2009;Rubio-Perez et al. (2017), which explored potential comorbidity relationships by constructing a disease-disease network (DDN) based on shared diseaseassociated genes and their functions (Rubio-Perez et al., 2017;Verma et al. (2019), which similarly identified disease comorbidities on the basis of shared genetic components by constructing a DDN based on a phenome-wide association study (PheWAS) that used electronic health record (EHR)-linked biobank data (Verma et al., 2019;Dong et al., 2021), which discovered disease multimorbities using hospital inpatient data in the UK Biobank and provided biological explanations by constructing disease network with a set of genome-wide association studies (GWASs) data (Dong et al., 2021).
Separately, several studies have attempted to categorize comorbidity as being direct or inverse (Ibáñez et al., 2014;Sá nchez-Valle et al., 2020;Tabarés-Seisdedos and Valderas, 2013;Tarantino et al., 2018); i.e. given an underlying disease, conditions that frequently co-occur at the population level are considered directly comorbid, while those with relatively few co-occurrences have inversely comorbid. Catalá -Ló pez et al. discovered that Alzheimer's disease can decrease the co-occurrence of cancers from meta-analysis with literatures (Catalá -Ló pez et al., 2014). Roitmann et al. revealed inverse comorbidity relationships from clustering analysis with health records in Demark (Roitmann et al., 2014).
DDNs represent a map of topologies between phenotypes having shared biological components that may indicate phenotypic similarity and comorbidity. Several DDNs were constructed to provide insight into the human disease interactome by using genetic components shared with diseases, such as genes, proteins, or pathways (Barabási et al., 2011;Goh et al., 2007;Liu et al., 2014;Zhou et al., 2014). Other studies have used summary statistics obtained from GWASs to observe disease-disease associations by leveraging disease-associated single nucleotide polymorphisms (SNPs) (Darabos et al., 2015;Dong et al., 2021;Nam et al., 2022;Verma et al., 2019). However, these previous association-based DDN studies have not considered whether the associations or interactions that depend on shared components are agonistic or antagonistic. As shared genetic components can have different effects on different disease mechanisms, incorporating this directional information can improve the ability to predict comorbidity. For example, cardiovascular diseases have a synergistic association with low-density lipoprotein cholesterol (LDL-C) and an antagonistic association with high-density lipoprotein cholesterol (HDL-C): high LDL-C may increase the risk of cardiovascular disease, while high HDL-C may reduce it (Di Angelantonio et al., 2009;Sharrett et al., 2001). Insight into which disease pairs are likely or unlikely to co-occur may help in understanding the potential relationships of complex diseases. Moreover, as our knowledge of biological mechanisms advances, further genetic components of risk factors or protective factors related to disease mechanisms are revealed (von Mutius and Smits, 2020;Zhu et al., 2018). Notably, Bulik-Sullivan et al. (2015) successfully identified negative genetic correlations that agreed with epidemiological associations (negatively correlated traits co-occur less than expected by chance), and also identified phenotypic correlations between complex traits and diseases via LD Hub and PhenoSpD as a follow-up study (Bulik-Sullivan et al., 2015;Zheng et al., 2017Zheng et al., , 2018. They utilized linkage disequilibrium score (LDSC) regression to estimate the strength of genetic correlations between diseases by comparing association test statistics between two diseases according to LD score trait in a polygenic model. Therefore, LDSC-based methods effectively calculate genetic correlations mainly for polygenic traits, using all variants regardless of their statistical significance in GWAS (Bulik-Sullivan et al., 2015). Because quantifying genetic correlations using LDSC regression focused on the entire spectrum of variants in GWASs, it is hard to directly apply their method to estimate disease interactions based on knowledge-based shared genetic components (e.g. disease-gene associations cataloged in the Online Mendelian Inheritance in Man database) or data-driven shared components with statistical thresholds (e.g. based on significant variants in genome-wide data).
The association-based DDNs that consider only the total amount of shared components are not easy to determine whether the relationships between diseases are agonistic or antagonistic. In contrast, the correlation-based DDNs computed by LDSC regression can determine disease interactions but it may include more false-positive disease associations in the resulting networks than those considering only significant SNPs. This motivates us to explore interactions between diseases considering the direction of effect of shared genetic components that reach genome-wide significance in PheWAS summary statistics and to study how inter-disease interactivity is related at the epidemic and genomic levels.
Here, we propose a novel network-based framework to build an inter-disease interaction network that can take into account the direction of effects for components shared between diseases. We consider phenotypes as having either synergistic or antagonistic association: synergistic if two diseases mutually increase the tendency of their co-occurrence, or antagonistic if they decrease the tendency of co-occurrence. To examine whether the interactivity of disease associations can help predict/prioritize comorbidity, we constructed a DDN that considers the direction of effect of SNPs significantly associated with diseases, obtained from PheWAS summary statistics. This inter-disease interactivity network is a signed diseasedisease network (signed DDN) having both positive and negative edges in the graph (i.e. positive values for a synergistic association and negative for an antagonistic association). To translate synergistic and antagonistic association at the genomic level to direct and inverse comorbidity at an epidemic level, we also develop a novel graph-based semi-supervised learning (SSL) for comorbidity scoring, modifying the objective functions of label propagation algorithms to work for a signed DDN with a unary label by introducing a signed degree matrix and signed graph Laplacian (Gallier, 2016). The scoring algorithm prioritizes comorbid diseases in relation to an index disease of interest.

Materials and methods
The overall procedure consists of three main parts ( Fig. 1): (i) constructing the signed DDN based on shared genetic components, (ii) predicting comorbidity scores using graph-based SSL and (iii) prioritizing/ranking comorbid conditions in relation to a given disease. We constructed a signed DDN with pre-defined/pre-calculated disease associations. (c) Prioritizing comorbidity. Higher scores mean higher chance of disease co-occurrence; D5 has the highest chance of comorbidity with D2 First, we built a signed DDN based on the sharing of significant SNPs between disease pairs, which relied on data obtained from UK Biobank PheWAS summary statistics. We determined disease associations by accounting for the overall direction of effects of genetic components shared between diseases, and categorize those associations as synergistic or antagonistic. In this network, two diseases have a synergistic association if their shared SNPs have the overall same direction of effect, or an antagonistic association if the SNP effects are in opposition. For example, Disease 1 (D1) and Disease 2 (D2) have shared components in SNP1 and SNP2, whereas D2 and D3 share SNP4 (Fig. 1a). Both SNP1 and SNP2 display the same direction of effect on D1 and on D2: SNP1 is associated with positive direction of effects (increasing risk), while SNP2 is associated with negative direction of effects (decreasing risk). Thus, the association between D1 and D2 is synergistic. In contrast, D2 and D3 have an antagonistic association because the respective effect of SNP4 is opposite (i.e. positive for D2 but negative for D3). Once the associations are determined, a signed DDN is constructed to represent the relationships between phenotype pairs as a graph with nodes and edges (Fig. 1b).
Second, we perform comorbidity score prediction: given a specific disease, we predict comorbidity of other diseases by applying graph-based SSL to the constructed network (Lee et al., 2020;Nam et al., 2019a,b). Since the network includes both synergistic and antagonistic associations (edge weights with positive and negative values), we posited the following hypotheses concerning comorbidity predictions: (i) two diseases have a chance of comorbidity if they are connected (i.e. have at least one shared SNP), regardless of association direction; (ii) if two diseases have a synergistic association according to overall direction of shared SNPs, they are more likely to co-occur (high chance of co-occurrence); and (iii) if two diseases have an antagonistic association, they are less likely to co-occur (low chance of co-occurrence). If these hypotheses are supported, scoring algorithms can predict the comorbidity of another medical condition given a specific underlying disease.
Finally, we prioritize the chance of disease co-occurrence based on the predicted comorbidity scores, stratifying by deciles. All diseases in a network have some chance of comorbidity with the index disease, but the degree of association can range considerably; a representative example is illustrated in Figure 1c. This ranking is then validated using disease prevalence-based co-occurrence measures estimated from EHRs (Hidalgo et al., 2009).

Constructing the DDN from PheWAS summary data
The proposed DDN was built with a shared genetic component hypothesis, namely that two different phenotypes were linked if they shared the significant SNPs from the PheWAS summary statistics. We obtained the beta-coefficient (b ik ) and standard error (SE ik ) values for the association between phenotype i and SNP k when that association passed the P-value threshold for significance in summary statistics. Then, we selected an appropriate significance threshold (P-value < 1 Â 10 À4 ) and constructed a disease-SNP association matrix: specifically, given m phenotypes and k SNPs, we constructed the association matrix R 2 R mÂk (Fig. 2a). For each SNP k associated with disease i, the element r ik ¼ b ik =SE ik is a constant value of z-score consisting of the corresponding beta-coefficient divided by the standard error. The sign of r ik indicates the direction of effect of SNP k in relation to disease i. The DDN was then constructed from this matrix to represent the genetic associations between pairs of phenotypes by measuring proximity between disease-SNP vectors.
This DDN is an undirected, signed and weighted graph G ¼ :; mg denotes the set of diseases and W ¼ w ij f j i; j ¼ 1; 2 . . . ; mg denotes the similarity between diseases. Each disease v i ¼ fr i1 ; r i2 ; . . . ; r ik g has k-dimensional SNP vectors, and the similarity w ij between any two diseases v i and v j is calculated as follows: (1) where r ik and r jk are the respective z-scores associated with SNP k for diseases v i and v j ; K is the total number of SNPs; and S is the set of significant SNPs shared between v i and v j . Since the z-score r i can take a negative value, each edge weight w ij can range from À1 to 1. Figure 2a illustrates the constructed disease similarity matrix (W) with both positive and negative associations. The magnitude of a weight ( w ij j j ) between a disease pair describes the number of significant SNPs those diseases share. A synergistic association (w ij > 0) means that two diseases have shared SNPs and the overall effect of those SNPs has the same direction for both diseases. An antagonistic association (w ij < 0) indicates that the shared SNPs have overall opposite directions of effect for the two diseases. The similarity matrix can quantify how similar two diseases are in terms of the directions of SNP effects.

Network-based comorbidity scoring algorithms
Once the DDN incorporating synergistic and antagonistic associations (signed DDN) was constructed, comorbidity scoring was performed using a graph-based SSL, which is employed for scoring algorithms (Chong et al., 2020;Subramanya and Talukdar, 2014). Figure 2b describes the problem setup for predicting comorbidity scores. The network-based scoring algorithm used here is a transductive learning approach (Chong et al., 2020); it predicts the cooccurrences between an underlying disease and other diseases when only the underlying disease is known to occur (one positive sample), and all others have unknown occurrences (unlabeled samples). Those scores can provide prioritized comorbidity scores for the unlabeled diseases by propagating with disease associations from the signed DDN. The proposed network simultaneously contains both positive and negative edges, obtained from (1); i.e. two diseases having SNPs in common can co-occur regardless of the respective directionality of SNP effects. Nevertheless, the chance of co-occurrence for two synergistically associated diseases may be relatively higher than that of two diseases having an antagonistic association, since the former pair shares many SNPs with the same direction of effect while the latter features SNPs with opposite directions of effect.
The following describes the procedure of comorbidity scoring. Consider a signed DDN G ¼ ðV; W Þ, where the similarity matrix W has both positive and negative edge weights. Let y ¼ y 1 ; . . . ; y m ð Þ T denotes the initial labels for the set of diseases and f ¼ Þ T the set of predicted comorbidity scores. We set a unary label (y l ¼ þ1) for an index disease of interest (v l ) and set unlabeled (y y l ¼ f0gÞ for remaining other diseases. Label information on the labeled disease is propagated to unlabeled diseases on graph G to obtain real-valued scores f , with two assumptions: the smoothness condition (predicted scores f i should not be too different from the f j values in adjacent unlabeled nodes) and the loss condition (predicted scores f i should be close to the given label of y i ). Generally, for an unsigned graph, G $ ¼ ðV; W $ Þ, the smoothness condition is represented as where the graph Laplacian L is defined as being the unsigned similarity matrix, D ¼ diagðd i Þ the degree matrix and d i ¼ P j w ij . However, since the proposed network has signed edges from (1), we modified the smoothness condition of (2) to handle both positive and negative associations by incorporating a signed degree matrix: where L is the signed graph Laplacian, defined as L ¼ D À W, in which D ¼ diagðd i Þ is the signed degree matrix with d i À ¼ P j w ij j j and signðÁÞ is the signum function. The signed graph Laplacian L is positive semidefinite, and the quantity of f T Lf is non-negative (Gallier, 2016). The loss condition is f À y The predicted output f is obtained by minimizing the following quadratic function using the loss and smoothness conditions: The closed form solution is obtained as where the hyper-parameter l trades off loss and smoothness. The resulting predicted outcome f on unlabeled nodes has either positive or negative values. We denote the predicted comorbidity scores f as the DDN-driven comorbidity, which indicates the relative likelihood of two diseases co-occurring based on the direction of effects of genetically associated factors.

Prioritizing and categorizing comorbidity
We prioritize and categorize the list of co-occurring diseases as having higher or lower chance of co-occurrence. Hereafter, we denote the higher chance of comorbidity as direct comorbidity and the lower chance as inverse comorbidity (Catalá -Ló pez et al., 2014; Tabarés-Seisdedos and Baudot, 2016). We categorize DDN-driven comorbidity f into direct and inverse comorbidity. For validating DDN-driven comorbidity, we used co-occurrence measures estimated from EHR (denoted as EHR-driven comorbidity). EHRdriven comorbidity is defined in terms of the disease prevalencebased /-correlation and relative risk (Hidalgo et al., 2009). The Pearson correlations for binary variables (/-correlation) and relative risk were used to estimate EHR-driven comorbidity (Fig. 2c). Given N patients and the list of m diagnoses in clinical records, we constructed a ( m 2 Àm 2 ) binary contingency matrix for each disease pair to display the frequency distribution. The /-correlation for the pair of diseases i and j is calculated by / ij ¼ CijNÀPiPj ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi PiPjðNÀPiÞðNÀPjÞ p based on their prevalence, where C ij is the number of patients diagnosed with both diseases while P i and P j indicate the number of patients diagnosed with disease i and j, respectively. A positive correlation means two diseases tend to occur simultaneously and a negative correlation that they tend not to co-occur. We also captured tendency of co-occurrence among disease pairs by calculating relative risks. The relative risk for two diseases RR ij ¼ CijN PiPj is measured as the ratio of risk between the diseases based on the disease prevalence. RR ij > 1 indicates that both diseases co-occur more frequently than expected by chance, and RR ij < 1 indicates their less frequent co-occurrence than expected by chance (Hidalgo et al., 2009;Sá nchez-Valle et al., 2020). A disease pair i and j was defined as having EHR-driven comorbidity when both measures indicated a tendency to co-occur (/ ij > 0 and RR ij > 1).
Taking as ground truths the corresponding EHR-driven comorbidity, DDN-driven comorbidity scores f from Equation (5) for each unlabeled disease can be decomposed into f þ and f À based on a threshold value. Suppose thatŷ ¼ŷ 1 ; . . . ;ŷ m ð Þ T is the ground truths, thenŷ i can take a value of 'þ1' if disease i is defined as having EHR-driven comorbidity, and 'À1' otherwise. Then, the receiver operating characteristic (ROC) curve was analyzed to determine a threshold value J for deciding the labels on f . Youden's J statistic was used for a threshold value where the maximum value of J ¼ sensitivity þ specificity À 1, as in a ROC curve (Schisterman et al., 2005). We categorized diseases as having DDN-driven direct comorbidity (high chance of co-occurrence) when f þ ¼ f i f j f i ! J; for 8ig and DDN-driven inverse comorbidity (low chance of co-occurrence) when f À ¼ f i f j f i < J; for 8ig.

Data processing for network construction and validation
Disease-SNP associations were generated from UK Biobank PheWAS summary statistics based on EHR-derived broad phenotype codes (PheCode) (Wei et al., 2017;Wu et al., 2019). The PheWAS summary statistics were obtained from https://www.lee labsg.org/resources, and include 1403 phenotypes , of which 976 were excluded in this study due to having a relatively low number of cases (<1000) or representing injuries and poisonings, symptoms, or sex-specific disease. Ultimately, network construction and comorbidity prediction were performed on 427 diseases belonging to 14 phenotype categories (Fig. 3). Preprocessing of the PheWAS summary results was performed using the following steps: (i) 588 711 independent SNPs were selected by linkage disequilibrium pruning with a threshold (window size: 50 kb, step size: 5 kb, and r 2 threshold: 0.5), and (ii) SNPs with association P-value < 1 Â 10 À4 were selected, yielding 39 382 SNPs for analysis. Disease-SNP associations were represented as a (427 Â 39 382)-dimensional matrix in which the association of each element pair was represented by a z-score. This matrix was then used to construct a DDN with signed associations using Equation (1). EHR-driven comorbidity information was collected and generated from the UK Biobank hospital episode statistic database. Disease diagnoses for 502 505 UK Biobank participants are represented by ICD-9 and ICD-10 codes, which were mapped to PheCodes by referring to PheCode Map versions 1.2 and 1.2b1 (https://phewascatalog.org/). A patient-phenotype matrix was constructed with binary values indicating the diagnosis status of each participant for 427 diseases to calculate /-correlation and relative risk. More details about the UK Biobank PheWAS results are provided in Supplementary Text S1, and the phenotypes used in this analysis are listed in Supplementary Data S1.

Construction of the inter-disease interactivity network (signed DDN)
The resulting inter-disease interactivity network (Fig. 3a) consisted of 427 phenotypes (nodes) and 9223 total associations (edges), of which 6209 were synergistic (67%; red) and 3014 antagonistic (33%; blue) ( Fig. 3c and Supplementary Figs S1 and S2). We examined the distribution of associations across the 14 phenotype categories (Fig. 3b), and found connections between diseases within a phenotype category to typically be synergistic. Namely, diseases belonging to a given phenotype category often had consistent effect directions for their shared SNPs, which means that the chance of their co-occurrence may be higher. One feature of signed DDNs is the ability to observe genetic associations between diseases as the direction of effect of shared SNPs. Our signed DDN was constructed while taking into account both the total amount and direction of shared SNPs. The greater the number of shared SNPs, the stronger the association between two diseases; likewise, the more homogenous the direction of effect, the stronger the synergistic association. We examined the composition of synergistic/antagonistic associations in our signed DDN. Of the 427 diseases in the network, 364 feature both associations, 57 have only synergistic associations, 3 have only antagonistic associations and, remaining 3 diseases have no associations (Supplementary Data S2, https://hdpm.biomedinfo lab.com/ddn/signedDDN). To investigate a deeper biological interpretations of the network, we described the composition of phenotypes and interactions in signed DDN for coronary atherosclerosis (PheCode: 411.4) as case example (Supplementary Text S2). For example, Coronary atherosclerosis and type 2 diabetes (PheCode: 250.2) shared 66 SNPs, of which 55 SNPs were associated with positive direction of effects and 11 with negative direction of effects in relation to both diseases. By calculating the overall direction of effects (similarity ¼ 0.29), coronary atherosclerosis and type 2 diabetes were determined to have a synergistic association.

Comparison of comorbidity prediction performance
The objective of network-based comorbidity scoring algorithms using a signed DDN is to predict whether other diseases are more or less likely to co-occur when one disease occurs. We hypothesized that the direction of effects of shared SNP is informative in predicting comorbidity and in prioritizing disease co-occurrence. Namely, higher scores indicate greater sharing of SNPs and more consistent direction of SNP effects; thus, diseases having stronger synergistic association with a given index disease have higher predicted  Figure S3 shows heatmap for signed DDN (b) distribution of disease associations among the 14 phenotype categories. Upper and lower diagonal elements, respectively, depict the numbers of positive and negative associations across different categories. The main diagonal represents associations among diseases belonging to the same phenotype category. (c) Summarized information on the data used for constructing the signed DDN comorbidity scores. To test this hypothesis, we compared prediction performance between the proposed method (signed DDN) and the baseline method (unsigned DDN). The unsigned DDN was constructed as in previous studies (Goh et al., 2007;Nam et al., 2019a;Verma et al., 2019), with only the number of shared genetic components being considered. The difference between signed and unsigned DDNs is explained in Supplementary Figure S4. A transductive approach was used, in which only one set of experiments was carried per disease. EHR-driven comorbidity was used for ground truths. Prediction performance was evaluated in terms of the area under the receiver operating characteristics curve (AUC) and Spearman rank correlation (q) (Fawcett, 2006;Yule, 1919). Both AUC and rank correlation for an index disease of interest were obtained by comparing the predicted DDN-driven comorbidity scores and EHRdriven comorbidity scores (Supplementary Fig. S5).
Comparison of prediction performance between the signed and unsigned DDNs revealed that the interaction of the association between phenotypes can be helpful in predicting comorbidity (Fig. 4). The AUC distribution of the 427 phenotypes in the signed DDN was shifted higher relative to that of the unsigned DDN (Fig. 4a). Moreover, the signed DDN demonstrated superior performance, achieving an average of 0.601 (versus 0.573 with the unsigned DDN) and with values generally falling above the diagonal when plotted against corresponding unsigned scores (Fig. 4b). In the same manner, we compared Spearman rank correlation coefficients between the two DDNs (Fig. 4c). In the signed DDN, 341 diseases had positive correlations (above zero on y-axis, Fig. 4c), of which 259 exhibited significant correlations (0.103-0.556, P-value < 0.05). In the unsigned DDN, 163 diseases had positive correlations (above zero on x-axis, Fig. 4c), of which 53 were statistically significant (0.105-0.290, P-value < 0.05). Only five diseases statistically significant in the unsigned DDN achieved higher positive correlation coefficients than were determined with the signed DDN. Thus, direct comparison of DDN-comorbidity score and prevalence-based relative risk indicates that taking into account the direction of effect of shared SNPs could better predict disease comorbidity. Notably, when not considering direction of effect, only weakly positive or non-significant correlations were obtained between prevalencebased relative risk and predicted scores. Both AUCs and Spearman rank correlation coefficients can be interpreted as the explanatory power of comorbidity predictions for index diseases. Performance measurement can be affected by the statistical power of PheWAS for each disease, but they are affected more greatly by the structure of the network because they are the predicted result through a scoring algorithm. Thus, an index disease with a high AUC means that disease associations explained by genetic information can be interpreted as comorbidity relationships in epidemiology.

Interpretation of the correlation coefficient in signed/unsigned DDNs
We further investigated the top 15 diseases with significant rank correlations (marked as red in Fig. 4c). DDN-comorbidity scores for these diseases in the signed DDN were highly correlated with prevalence-based relative risk (q > 0.5 with P-value < 0.05), whereas those in the unsigned DDN were relatively low (0 < q < 0.3). Overall, the top 15 included 11 circulatory-related diseases that had high positive correlation coefficients in the signed DDN (Fig. 5a). Four diseases with highly significant correlations in the signed DDN did not achieve significance in the unsigned DDN (superscript in Pvalue, Fig. 5a). It can be inferred that considering SNP direction of effect and its commonality between diseases is capable of capturing complex disease associations that might not be identified when only considering sharing of SNPs.
In this approach, synergistically associated diseases directly connected to the index disease are likely to have the highest predicted scores. Consequently, we examined which diseases had direct connections when taking each of the 15 top diseases as the index disease (Fig. 5b). The results showed all diseases with high correlation coefficients to have relatively high ratios of synergistic association. In addition, we found that comorbidity scores derived from the signed DDN were highly correlated with EHR-driven comorbidity, even for diseases having many antagonistic associations, such as hypercholesterolemia, disorders of lipoid metabolism, myocardial infarction and coronary atherosclerosis. This further supports that considering the direction of effects is helpful when using PheWAS data to analyze the complex relationships between diseases, as the latent associations between diseases (synergistic/antagonistic association) provide important information for comorbidity prediction.

Clinical implication of comorbidity scoring based on the signed DDN
We demonstrated how to predict disease comorbidity by applying a graph-based SSL for signed graph to a signed DDN encompassing 427 diseases. To provide a case study illustrating the proposed method and its interpretation, coronary atherosclerosis was selected from among the top 15 diseases having significant rank correlations in the signed DDN. Coronary atherosclerosis is a frequent cause of coronary artery disease, a common chronic condition for which disease risk is characterized by a substantial and complex polygenic contribution, with heritability between 40% and 60% (Fischer et al., 2005;Marenberg et al., 1994;McPherson and Tybjaerg-Hansen, 2016). We provided the comorbidity scores for a list of direct comorbidity diseases (Table 1), and the comorbidity score curve obtained from Equation (5) (Fig. 6a). To suggest the practical use of comorbidity scores, we divided the scores into deciles and categorized accordingly, ranging from high risk of comorbidity (direct comorbidity) to low risk of comorbidity (inverse comorbidity). We decided the Tier-1 diseases as direct comorbidity group and Tier-10 diseases as inverse comorbidity group according to their ranking of predicted scores. The remaining diseases between Tier-2 and Tier-9 were relatively lower co-occurrence than Tier-1 or higher cooccurrence than Tier-10.
We examined the sub-networks for diseases (Fig. 6b) having high risk of co-occurrence (Tier-1) and those having low risk of cooccurrence (Tier-10). All Tier-1 diseases were synergistically associated with coronary atherosclerosis and also with each other, which means that high-ranked diseases overall have shared SNPs with consistent direction of effect. In contrast, within the Tier-10 group, all diseases directly connected with coronary atherosclerosis had antagonistic associations and the predicted scores were negative. In Tier-10 group, macular degeneration (PheCode: 362.29) was antagonistically associated with coronary atherosclerosis and the predicted comorbidity was inverse. The scoring algorithm with the signed DDN detected the known relationships that high levels of HDL-C were associated with low risk of coronary artery disease and associated with high risk of age-related macular degeneration (Fan et al., 2017). A sub-network consisting of all diseases directly connected with coronary atherosclerosis is given in Supplementary Figure S6. The top 20 diseases having direct comorbidity with coronary atherosclerosis are summarized in Table 1.

Criterion of leveraging disease-SNP associations in summary statistics
We developed a signed DDN by leveraging the overall direction of the effects of genetic components shared between diseases. We also performed predictions of co-occurrence diseases given an index disease to examine whether the signed DDN can have more reliable explanatory power for disease interactions. To propagate the seed label information of index disease to the rest of the diseases in a network through the scoring algorithm, we took a less stringent genome-wide significance level (P-value < 1 Â 10 À4 ) for network construction. However, depending on the arbitrary selection of significance level, the underlying structure of the network can be changed. The disease interactions are sparser at the most stringent level and denser at the less stringent level. It is necessary to explore which P-value threshold in the disease-SNP associations for defining disease associations. We built another comparative DDN by performing LDSC regression to investigate whether comorbidity prediction results varied according to the number of SNPs in building a network. We built the genetic correlation-based DDN (denoted as LDSC-DDN) by performing LDSC regression with UK Biobank PheWAS summary data to estimate the genetic correlations of 427 pairs used for the signed DDN (Bulik-Sullivan et al., 2015). The LD scores were calculated from European samples in the 1000 Genomes Project phase 3 database (Auton et al., 2015). We considered disease pairs with positive and negative correlation values with significance (P-value < 0.05). We took the correlation matrix as the similarity matrix of the DDN and applied the proposed scoring algorithm to predict comorbidity. The LDSC-DDN had a higher density value of 19.97% (18 165 edges across 427 nodes) than the signed DDN  Fig. 3c). We performed comorbidity prediction tasks and calculated AUCs with the same experimental settings as the signed network. Empirically, the dense network can provide more accurate inferences than sparse network. However, comorbidity prediction for 98 diseases could not be conducted because they were disconnected from other diseases in the LDSC-DDN. The LDSC results cannot be obtained when applied to datasets with low statistical power of GWAS (due to small sample size or rare trait) for estimating heritability and genetic correlation. Most connections in the LDSC-DDN were obtained from diseases with high statistical power. The proposed signed DDN had advantage of discovering inter-disease interactions by leveraging significantly associated SNPs, even though the number of SNPs used for constructing network was smaller. The detailed results are provided in Supplementary Text S3.

Discussion and conclusion
We proposed a novel signed DDN based on biobank-scale PheWAS summary statistics that considers the direction of effect of shared SNPs, and further presented the utility of this DDN in prioritizing diseases according to comorbidity. To design this DDN, we measured the overall direction of effect of SNPs shared between pairs of diseases, and further categorized disease-disease associations as synergistic or antagonistic depending on whether that overall direction is consistent or opposite between diseases. Our results demonstrated that considering the interaction and direction of shared components is more helpful in predicting comorbidity and ranking comorbid diseases than considering only the quantity of shared components.
We also developed a novel label propagation algorithm for signed networks and applied it to show that the signed DDN not only models disease relationships at the population level, but also is applicable to comorbidity prediction in the context of personalized medicine. The comorbidity scoring algorithm was designed such that when the initial label on the index disease encounters a 'disease connected by a synergistic association', the score is propagated with a positive value (increasing the likelihood of co-occurrence), whereas when it encounters a 'disease connected by an antagonistic association', the score is propagated with a negative value (reducing the likelihood of co-occurrence).
One advantage of the proposed method is that the signed DDN can aid in the interpretation and comprehension of the intricate associations among diseases, and moreover can be easily updated and reinforced when new PheWAS results are obtained for a phenotype not yet included in the network. A second advantage is its ability to categorize diseases as having direct and inverse comorbidity based on the defined synergistic/antagonistic associations. It is very Note: An asterisk denotes a disease pair that were predicted as having direct comorbidity from DDN-driven comorbidity, but inverse comorbidity on the basis of EHR-driven comorbidity (RR, relative risk; 95% CI, 95% confidence interval; corr, correlation). The solid line indicates DDN-driven comorbidity scores (f) of the other 426 diseases in the dataset, sorted in descending order with respect to score on the x-axis. The y-axis represents comorbidity scores, rescaled to values between À1 and 1. (b) Sub-networks of diseases having high comorbidity (Tier-1) and low comorbidity (Tier-10) with coronary atherosclerosis important to know both which conditions have higher chance of co-occurrence (direct comorbidity) and which have lower chance of co-occurrence (inverse comorbidity) for a given disease. Such categorized relationships enable clinicians to construct comorbidity scenarios in advance, and can aid patient treatment planning. Our study also used co-occurrence and disease prevalence information obtained from UK Biobank hospitalization episodes to confirm and validate the DDN-driven predictions of direct/inverse comorbidity. Although omnidirectional validation has not been performed, our findings support that a DDN based on the relations of diseases and genetic components provides sufficient information for the prioritization and categorization of disease comorbidity.
Some aspects of this study remain to be investigated in future work. One of the primary limitations is that we utilized data from a single cohort of the UK Biobank. Since the UK Biobank mainly includes healthy participants, the constructed network might not capture disease prevalence in populations of different demographic characteristics. We also limited our analyses to those phenotypes occurring in both sexes. Some significant genetic components might vary by sex, and thus it is necessary to construct sex-specific signed DDNs and determine sex-specific comorbidity. Although our stratification of comorbidity groups is a product of a specific population, the approach can be extended for use in precision medicine to screen individuals at comorbidity risks.